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ABSTRACT 

Exome sequencing (exome-seq) has aided in the 
discovery of a huge amount of mutations in 
cancers, yet challenges remain in converting 
oncogenomics data into information that is inter- 
pretable and accessible for clinical care. We con- 
structed DriverDB (http://ngs.ym.edu.tw/driverdb/), 
a database which incorporates 6079 cases of 
exome-seq data, annotation databases (such as 
dbSNP, 1000 Genome and Cosmic) and published 
bioinformatics algorithms dedicated to driver 
gene/mutation identification. We provide two 
points of view, 'Cancer' and 'Gene', to help re- 
searchers to visualize the relationships between 
cancers and driver genes/mutations. The 'Cancer' 
section summarizes the calculated results of driver 
genes by eight computational methods for a specific 
cancer type/dataset and provides three levels of 
biological interpretation for realization of the rela- 
tionships between driver genes. The 'Gene' section 
is designed to visualize the mutation information of 
a driver gene in five different aspects. Moreover, a 
'Meta-Analysis' function is provided so researchers 
may identify driver genes in customer-defined 
samples. The novel driver genes/mutations 
identified hold potential for both basic research 
and biotech applications. 



INTRODUCTION 

Next-generation sequencing (NGS) has greatly increased 
the identification of mutations in cancer genomes and 
allows researchers to profile the molecular characteristics 
of various cancer types. In the past few years, applying 
exome sequencing (exome-seq) in oncogenomics studies 
has become the norm (1). Also, enormous amounts of 
cancer genomics data have been generated from large- 
scale cancer projects (2) such as The Cancer Genome 
Atlas (TCGA), the International Cancer Genome 
Consortium (ICGC), the Therapeutically Applicable 
Research to Generate Effective Treatments (TARGET) 
and the Pediatric Cancer Genome Project (PCGP). 
Although NGS has already helped researchers discover 
huge amounts of aberrant events in cancer genomics, 
translating these data into information that can be easily 
interpreted and accessed is still challenging. 

Cancers are primarily caused by the accumulation of 
genetic alterations and could be characterized by 
numerous somatic mutations. However, not all of these 
mutations are involved in tumorigenesis. Only a subset 
of mutations contributes to cancer development, whereas 
others make no or little important contribution. To crys- 
tallize this concept, the terms 'driver and 'passenger' 
mutation have been coined (3). The mutations that 
confer a selective growth advantage to the tumor cell are 
called 'driver' mutations (1). 'Passenger' mutations are 
defined as those which do not confer growth advantage 
but that do occur in a cell that coincidentally or subse- 
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quently acquires a driver mutation (4). In most solid 
tumors, an average of 33-66 genes with somatic mutations 
were found to alter their protein products, but the count 
of non-synonymous mutations varies across cancer types 

(I) . More than 80% of mutations are missense (1), and 
these mutations vary highly in their functional impact 
depending on their position and function in the protein 
and the nature of the replacement amino acid. It remains a 
significant challenge to identify cancer driver mutations 
because many observed missense changes are neutral pas- 
senger mutations (5). Several computational algorithms 
have been developed to predict the functional impact of 
missense mutations based on concepts including evolu- 
tionary conservation, structural constraints and the 
physicochemical attributes of amino acids. In the last 
few years, machine learning methods have been developed 
to specifically predict cancer-driving deleterious mutations 
(6-8). 

A driver gene is defined as a gene whose dysfunction 
will cause tumori genesis. Vogelstein et al. have demon- 
strated the fundamental difference between a driver gene 
and a driver mutation (1). Numerous computational 
methods to identify driver genes have been published; 
algorithms such as MutsigCV (9), MuSiC (10), Simon 

(II) , OncodriverFM (12) and ActiveDriver (13) are 
based on the mutation frequency of an individual gene 
compared with the background mutation rate. However, 
background mutation rates among different genome 
regions and patients are highly variable (9). Recent 
studies have shown that the mutation rate varies in 
normal cells by more than 100-fold within the genome 
(14) and that such variation is higher in tumor cells (15). 
To correct for this bias, MutSigCV uses patient-specific 
mutation frequency and spectrum, as well as gene- 
specific background mutation rates. OncodriverFM 
incorporates the functional impacts of mutations as add- 
itional information. ActiveDriver identifies driver genes 
with statistically significant mutation rates in phosphoryl- 
ation-specific regions. Other methods are based on the 
sub-network approach (16-24) that can identify groups 
of genes containing driver mutations directly from 
cancer mutation data either with or without prior know- 
ledge of pathways or other information of protein/genetics 
interactions. This approach is successful particularly when 
the observed frequencies of passenger and driver muta- 
tions are indistinguishable, a situation wherein single 
gene tests fail. Moreover, sub-networks are believed to 
identify cancer driver genes with low recurrence (25). 
Most of sub-network based methods, such as MEMo 
(19), MDPFinder (16), Dendrix (17), Multi-Dendrix (18) 
and RME (24), identify driver genes with the characteris- 
tics of mutual exclusivity. Moreover, sub-network 
methods could additionally incorporate copy number 
variation (CNV) data for driver gene identification 
(16-19,22,24). 

In this study, we present the DriverDB database, which 
incorporates a large amount (>6000 cases) of exome-seq 
data, annotation databases (such as dbSNP (26), 1000 
Genome (27) and COSMIC (28)), and the various bio- 
informatics algorithms devoted to defining driver genes 
or mutations. DriverDB focuses on predicting driver 



genes by various algorithms and provides different 
aspects of the mutation profiles of an individual gene. 
We provide two view points, 'Cancer 1 and 'Gene', for 
benefiting researchers to visualize the relationships 
between cancers and driver genes/mutations. A 'Meta- 
Analysis' function is further included in the DriverDB 
for allowing researchers to identify driver genes of 
custom-defined samples according to clinical criteria. 

MATERIALS AND METHODS 

Dataset collection 

As shown in Figure 1, DriverDB includes mutation 
profiles from 6079 tumor-normal pairs, including 4397 
from TCGA, 861 from ICGC, 112 from PCGP, 238 
from TARGET and 471 from published papers (denoted 
as 'others' in Figure 1). Detailed information for the 
datasets is provided in Supplementary Table SI. The 
mutation data and CNV data of these pairs were retrieved 
from the data portal of the projects or from the supple- 
mentary data of the published papers, and were then 
parsed using in-house Perl scripts. To ensure annotation 
consistency and to make the retrieval process more effi- 
cient, clinical information for each sample was manually 
curated, based on clinical data obtained as mentioned 
above. Each sample was re-annotated with 38 clinical 
characteristics. The summary of the clinical information 
is provided in Supplementary Table S2. 

Mutation annotation 

All mutations were mapped to known databases, and their 
functional impacts were predicted by numerous bioinfor- 
matics tools shown in the Annotation module in Figure 1 . 
For annotating known variants, DriverDB incorporates 
the information collected from different databases 
including dbSNP, NHLBI GO ESP (29), 1000 genomes, 
COSMIC, ClinVar (http://www.ncbi.nlm.nih.gov/clinvar/), 
NHGRI GWAS catalog (30), HGMD-PUBLIC (31) and 
OMIM (http://omim.org/). We used SnpEff (32) and VEP 

(33) to predict the effect of each mutation, such as non- 
synonymous coding, stop gained/lost and frame-shift. 
In addition, DriverDB scores the deleterious effects and 
functional impact by seven algorithms, including SIFT 

(34) , PolyPhen2 (35), Condel (36), LRT (37), FATHMM 
(38), MutationAssessor (39) and MutationTaster (40). 
Furthermore, we scored each mutation by the number of 
algorithms that judge the mutation as deleterious (these 
numbers are denoted as 'Driver Score'). For example, 
the mutation g.l78952085A>G of PIK3CA, which occurs 
in > 100 patients from various cancer types, was identified 
as deleterious by seven algorithms; therefore, its Driver 
Score is 7. 

Driver gene identification 

DriverDB utilized eight computational methods to 
identify driver genes of cancer types (the Cancer Driver 
Gene module in Figure 1). Four methods, including 
MutsigCV, Simon, OncodriverFM and ActiveDriver, are 
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Figure 1. Schematic representation of data processing. 



based on mutation frequencies and utilize all mutations to 
identify driver genes. 

For the sub-network based methods, MEMo, Dendrix, 
MDPFinder and NetBox were used. We applied the 
following filters to remove mutations/genes from the 
analysis: 

• Mutations whose effect impact was identified by 
SnpEff as 'Low' or 'Modifier.' 

• Mutations denoted as common and not recorded in 
disease/clinical-related databases according to 
Mutation annotation. 

• Potentially spurious genes reported by several studies 
(9,18). 

Detailed criteria for each method are described in 
Supplementary Methods. 

Functional analysis 

For each set of driver genes identified by individual/ 
multiple method(s) in a group of cancer samples, we 
provided three levels of biological interpretation (Gene 
Oncology, Pathways and Protein/Genetic Interaction) to 
help researchers to realize the relationships between driver 
genes. In the 'Gene Oncology' part, we used the topGO 



and GeneAnswers packages of Bioconductor to calculate 
the topology of the GO graph, as well as to visualize the 
many-to-many relationships between GO terms and genes. 
In the 'Pathway' analysis, we used collections from 
KEGG (41), PID (42), Biocarta (http://www.biocarta. 
com/), REACTOME (43) and MSigDB (44) to annotate 
driver genes. Detailed information for these eight collec- 
tions is provided in Supplementary Table S3. The three 
databases, IntAct (45), BioGRID (46) and iReflndex (47), 
were used to interpret the Protein/Genetic Interaction. We 
also performed classic Fisher's exact test and utilized 
-log(P value) to score each GO term and Pathway 
category in the Gene Oncology and Pathway analyses. 
For the 'Pathway' and 'Protein/Genetic Interaction' 
sections in the DriverDB web interface, the Cytoscape 
Web (48) tool was embedded for interactive network 
visualization. 

WEB INTERFACE 

Cancer 

The 'Cancer' section stored the calculated results of driver 
genes for a specific cancer type/dataset. First, users can 
define the data type(s) incorporated for driver gene 
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identification (the red rectangle in Supplementary Figure 
SI A) and then select a specific dataset, for example, 
'Glioblastoma multiforme' (GBM). The result section 
will then indicate the detailed information of the specific 
dataset (red circle in Supplementary Figure SIB). Users 
can select a driver gene set identified by 'JV methods 
(the 'Summary' in Supplementary Figure SIB; 'TV is 
determined by a drop-down menu) or by individual 
methods according to the name of the method 
(Supplementary Figure SIB). For 'Summary', a heat 
map shows the relationship between genes and methods 
(Supplementary Figure SIC; the blue color indicates genes 
identified as driver genes by a method). For each driver 
gene set, there is a heat map showing a mutation profile of 
that driver gene set of samples (Supplementary Figure 
SID). We also performed functional analysis in three 
levels of biological interpretation: 'Gene Ontology', 
'Pathway' and 'Protein/Genetics Interaction'. In the 
'Gene Ontology' analysis, I and II indicate the topology 
of GO graph by topGO and GeneAnswers, respectively, 
(Supplementary Figure S1E) whereas III and IV show the 
most significant GO terms and genes. The table in 
Supplementary Figure S1E lists the information of all 
the significant GO terms. In the 'Pathway' analysis, 
there are eight collections of gene sets from public data- 
bases including KEGG, REACTOME, MSigDB, PID 
and Biocarta. For each collection, there is a network visu- 
alization and a table displaying pathway categories of the 
driver genes that are involved (Supplementary Figure 
S1F). Finally, in the 'Protein/Genetics Interaction' part, 
the interactions between driver genes are illustrated ac- 
cording to three resources: BioGRID, IntAct and 
iReflndex (Supplementary Figure S1G). 

Gene 

In this section, researchers can visualize the mutation data 
for a specific protein encoded by a gene in five different 
kinds of aspects: Mutation Profile, Mutation Percentage, 
Exon, Driver Score and Mutation Information 
(Supplementary Figure S2A). Here, we use the gene 
PIK3CA, which is identified as a driver gene in the 
'Cancer' section, as an example. Bar chart colors in the 
sub-figures of Supplementary Figure S2 indicate the func- 
tional impact of a mutation, such as non-synonymous and 
frame-shift shown in Supplementary Figure S2B. For 
'Mutation Profile' (Supplementary Figure S2C), a heat 
map shows the mutation rate calculated by the mutation 
count/sample count of a cancer, at different protein pos- 
itions across several cancer types. We also provide exon 
and domain information with protein coordinates at the 
bottom of the heat map (Supplementary Figure S2C). 
Two bar charts located at the top and the left of the 
heat map indicate the sum of mutation rate according to 
protein position and cancer type, respectively. The 
'Mutation Percentage' (Supplementary Figure S2D) is 
similar to Supplementary Figure S2C, but the number in 
the heat map is calculated by the following: (mutation 
count of a protein region/total mutation count of a can- 
cer) x 100. The heights of the two bar charts at the left and 
the top of the heat map are normalized to the mutation 



count of a cancer type or a protein region, respectively. 
In the 'Exon' panel, the mutation counts and the mutation 
types of each exon are illustrated in Supplementary Figure 
S2E and S2F, respectively. For the 'Driver Score' part, 
Supplementary Figure S2G and S2H indicate the Driver 
Score (please see the 'Materials and Methods' section for 
details) distributions of exons and protein positions, re- 
spectively. All the mutation data of a specific protein are 
listed under 'Mutation Information' (Supplementary 
Figure S2A). 

Meta-analysis 

In addition to the stored calculated results, 
DriverDB allows researchers to identify driver genes of a 
user-defined, specific set of samples. As shown in 
Supplementary Figure S3, users can select one or 
multiple datasets in DriverDB. We provide a list of 
clinical criteria, such as ICD-O-3 histology, tumor stage, 
distant metastasis and lymph node status, to help 
researchers to select a sub-group of well-defined cancer 
samples according to one or multiple clinical parameters 
for driver gene identification. Users can overview the 
detailed clinical information of selected samples before 
submitting this job to the server for real-time calculation. 
The user will receive a notification email with a Result ID, 
and then visualized driver gene results in the 'Result and 
Download' section when the job is completed. 

DISCUSSION 

DriverDB makes the best of the massive amount of 
exome-seq data published in recent years by integrating 
driver gene analysis from numerous methods, as well as by 
providing visualizations of mutation information accord- 
ing to different aspects. As described in the 'Introduction' 
section, different bioinformatics algorithms have been 
developed to identify driver genes based on several 
assumptions and characteristics, each of which provides 
different points of view regarding driver genes. 
DriverDB integrates the analysis results of individual/ 
multiple method(s) and provides three levels of biological 
interpretation: Gene Oncology, Pathway and Protein/ 
Genetics Interaction. These visualization results will help 
users to quickly realize the relationships between driver 
genes. A representative example of driver genes identified 
in GBM is shown in Supplementary Figure SI. A total of 
14 driver genes were identified (each gene by at least 4 
methods), and nearly all samples had at least 1 deleterious 
mutation among these 14 genes. Ten genes (CDKN2A, 
EGFR, PTEN, TP53, CDK4, PIK3R1, NF1, PIK3CA, 
RBI and IDH1) are known to be critical in GBM tumori- 
genesis (49,50). For the other four genes (ATRX, CHEK2, 
CPSF6 and COL6A3), our functional analysis shows that 
they are involved in cell cycle-related categories 
(Supplementary Figure S1F). Moreover, ATRX has 
been reported as the driver gene in pediatric glioblastomas 
(51) and neuroblastomas (52,53). CHEK2 is relevant to 
familial breast/ovarian cancer (54) and neuroblastomas 
(54). CPSF6 can either enhance the invasive capacities of 
or inhibit the proliferation of cancer cells (55). The spliced 
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variants and the aberrant methylation of COL6A3, are 
also related to cancers (56-58). Genes reported in other 
references but not included in our 14-gene list can be 
identified by less stringent criteria (such as those identified 
by at least three methods; for example PDGFRA, 
MDM2, MDM4 and CDKN2B). 

The 'Gene' section is designed to help researchers to 
visualize the mutation data of a driver gene. The represen- 
tative example is PIK3CA, a well-known driver gene in 
GBM as well as in other cancers (Supplementary Figure 
S2). It is easy to find that there are two hotspot mutation 
regions (at the middle and the end of the protein), espe- 
cially in the 'Mutation Percentage' figure (Supplementary 
Figure S2D). The two well-known driver genes, BRAF 
and KRAS, also have the same characteristics 
(Supplementary Figure S4). However, a driver gene may 
have distinct hotspot mutation regions in different 
cancers. For example, unlike lung cancers that carry 
EGFR mutations at the kinase domain (KD), activation 
of EGFR in GBM occurs through mutation at the extra- 
cellular domain (59). This has been noted as the reason 
that GBM with mutations in the extracellular domain 
respond poorly to EGFR inhibitors (e.g., erlotinib) that 
target the active kinase conformation (59). This phenom- 
ena was recaptured by our calculation and was present in 
the 'Mutation Profile' of EGFR in DriverDB 
(Supplementary Figure S5). 

In the 'Gene' section, bar chart colors indicate the func- 
tional impact of a mutation, which can help to convey 
important information. For example, FLT3 has been 
reported to be mutated in approximately one-third of 
patients in acute myeloid leukemia and has two hotspot 
regions: one consists of internal tandem duplication (ITD) 
mutations of 3^100 bp (always in-frame), and the other 
consists of point mutations at aspartic acid 835 of the 
KD (60). Such mutation information for FLT3 can be 
easily obtained in DriverDB (Supplementary Figure S6). 

Several studies have assessed the performance of 
existing tools for predicting deleterious mutations, and 
the results have demonstrated that identifying cancer- 
driving mutations remains a significant challenge (5,61). 
Hence, we used the 'Driver Score', which integrates the 
information from seven computational tools, to describe 
the deleterious level of a mutation and to highlight the 
hotspot mutation region. For example, the Driver Score 
distribution of the cancer-related gene 'MLL2' implies 
that the third region of the MLL2 protein plays a more 
important role than other positions (Supplementary 
Figure S7). In summary, in the 'Gene' section of 
DriverDB, researchers can easily be informed when muta- 
tions are concentrated in one/some specific protein pos- 
ition(s)/domain(s)/exon(s)/cancer(s). 

The 'Meta-analysis' section allows a user to re-define a 
group of samples from one/multiple datasets and then 
identify driver genes for selected samples. It has been 
noted that mutations are accumulated during tumor 
progression. Different driver mutations may be used to 
convert a normal cell to a tumor cell, or to turn a 
benign tumor into a malignant one. The timing of muta- 
tions is relevant to metastasis, and there are mutations 
that occur during this process (1). Thus, if we could 



define samples by a clarified biological or clinical goal, 
we would have the opportunity to identify a specific set 
of driver genes for a distinct question. To achieve this, 
DriverDB offers a list of clinical characteristics to define 
samples and provides a high degree of freedom for 
researchers to utilize the huge amount of sequencing 
data. For example, in Supplementary Figure S3 we 
selected only 180 samples from TCGA breast cancer 
project. Their lymphonode pathologic spread and ICD 
oncology of histology are 'NO' and 'infiltrating duct car- 
cinoma, NOS', respectively. 

A number of databases and frameworks have been 
developed to integrate large-scale genomic data (2), 
including cBioportal (62,63) and IntOGen (64). 
cBioportal contains datasets from TCGA and provides 
gene-based search capabilities to interactively explore 
multidimensional cancer genomics data. IntOGen is a 
framework that integrates multidimensional data for the 
identification of genes and biological modules involved in 
cancer development. DriverDB incorporates a large-scale 
data mining work using these algorithms in one go, 
presents summarized driver genes, and provides different 
kinds of aspects for mutation visualization. Another 
unique part of DriverDB is that it also helps researchers 
to identify driver genes in a customer-defined manner. 

NGS has become the norm for large-scale cancer 
research, and cancer exome-seq results will accumulate 
rapidly in the next few years. For example, TCGA will 
examine over 11,000 samples for 20 cancer types by 
the end of 2014. Due to the Publication Guidelines of 
TCGA (http://cancergenome.nih.gov/abouttcga/policies/ 
publicationguidelines), parts of data from TCGA are 
excluded in DriverDB. As time goes by, data from 
TCGA, as well as from other cancer projects/literatures, 
will have no publication limitations and will be 
incorporated into updated DriverDB. We envision that 
these novel driver genes or mutations identified and 
stored in DriverDB will hold great potential for both 
basic research and biotech product development. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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